!jupyter nbconvert Project3_data_3.ipynb --to html
In this project, we are going to clean and process the following dataset.
We import pandas to work with our data, Matplotlib to plot charts, and Seaborn to make our charts prettier.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as ss
import seaborn as sns
import plotnine
import plotly.offline as py
import plotly.graph_objs as go
color = sns.color_palette()
sns.set(style="darkgrid")
import scipy.stats as ss
import matplotlib.pyplot as plt
from collections import Counter
from dython._private import convert, remove_incomplete_samples, replace_nan_with_value
from dython.nominal import associations
Let's load the "2016collisionsfinal.csv" which has been provided in datasets for the course.
rawdf = pd.read_csv('2016collisionsfinal.csv')
rawdf.head()
The first column is just the row index, and count for the record of collisions. We drop it, keep in mind count of columns represents the counts of collisions:
rawdf = rawdf.drop(['Record'], axis=1 )
rawdf.head()
print('The dataset 2016collisionsfinal has {} rows and {} features'.format(rawdf.shape[0],rawdf.shape[1]))
print('The list names of the features are : {} '.format(rawdf.columns.values))
print('Information about the type features:')
rawdf.info()
The data has 2 numerical features: 'X' and 'Y'. However, they are originally stored as string.
The data has 2 date/time features: 'Date' and 'Time'. However, they are originally stored as string.
The data has 8 categorical features: 'Location', 'Environment', 'Road_Surface', 'Traffic_Control', 'Collision_Location', 'Light', 'Collision_Classification','Impact_type'
rawdf.describe()
Quick check to see if there is any missing values at all:
print(rawdf.isnull().values.any())
The answer is Yes, then let's see which columns have missing values:
# checking missing data
total = rawdf.isnull().sum().sort_values(ascending = False)
percent = 100*(rawdf.isnull().sum()/rawdf.isnull().count()).sort_values(ascending = False)
missing_data = pd.concat([total, percent, percent.cumsum()], axis=1, keys=['Total missing ', 'Percent', 'Cumulative Percent'])
missing_data.head(10)
Since the missing values are a very small percentage of data, we can safely drop rows with missing values, we also rename data frame to df:
df = rawdf.dropna(axis=0 , how='any')
We do a quick check to make sure all rows with missing values have been dropped:
print(df.isnull().values.any())
# comparing sizes of data frames
print("Old data frame length:", len(rawdf), "\nNew data frame length:",
len(df), "\nNumber of rows with at least 1 NA value: ",
(len(rawdf)-len(df)))
The online source for this dataset is 2016 Tabular Transportation Collision Data on the Open Ottawa website.
We’ve loaded our data set into rawdf which basically looks like a table. Below is the Data Dictionary for the dataset:
data_dictionary = pd.DataFrame(index = df.columns, columns=['type', 'description'])
for name in df.columns:
data_dictionary.loc[name,'type'] = 'string'
data_dictionary.loc[df.columns[1],'type'] = 'float'
data_dictionary.loc[df.columns[2],'type'] = 'float'
data_dictionary.loc[df.columns[3],'type'] = 'date'
data_dictionary.loc[df.columns[4],'type'] = 'time'
data_dictionary.loc[df.columns[0],'description'] = 'Location description (RD1 @ RD2 or RD from RD1 to RD2)'
data_dictionary.loc[df.columns[1],'description'] = 'coordinate format is projected in MTM Zone 9, NAD83 (CSRS)'
data_dictionary.loc[df.columns[2],'description'] = 'coordinate format is projected in MTM Zone 9, NAD83 (CSRS)'
data_dictionary.loc[df.columns[3],'description'] = 'Date (m/d/yy)'
data_dictionary.loc[df.columns[4],'description'] = 'Time (hh:mm with hh from 00 to 24 and mm from 00 to 59)'
data_dictionary.loc[df.columns[5],'description'] = 'Environment (Clear, rain, snow, ...)'
data_dictionary.loc[df.columns[6],'description'] = 'Road surface condition (Ice, wet, dry snow, ...)'
data_dictionary.loc[df.columns[7],'description'] = 'Traffic control (stop, traffic signal, no control, ...) '
data_dictionary.loc[df.columns[8],'description'] = 'Collision location (Intersection, non-intersection, at/near private driveway)'
data_dictionary.loc[df.columns[9],'description'] = 'Light (daylight, dawn, dusk, ...)'
data_dictionary.loc[df.columns[10],'description'] = 'Classification of collision (non-fatal, fatal, property damage only)'
data_dictionary.loc[df.columns[11],'description'] = 'Initial impact type (Angle, turning movement, rear-end…) '
pd.set_option('display.max_colwidth', -1)
data_dictionary
The main goal : Understanding data by exploring and trying to spot unlikely and irregular patterns.
What are the features?
plt.figure(figsize=(10,100))
count = df['Location'].value_counts().head(500)
sns.barplot(count.values, count.index)
plt.xlabel('Number of Collisions', fontsize=12)
plt.title('Location', fontsize=15)
About 60 collision has been recorded at the intersection of St. Joseph Blvd and Jean d'Arc Blvd and more than 40 at the intersection of Hunt Club Rd and Riverside Dr. We can use X and Y coordinates to get a map of collisions. To do so, we need to first convert X and Y to a float as they are originally recorded in a string format. We do that with the following code:
df['X'] = df['X'].str.replace('\,', '').astype(float)
df['Y'] = df['Y'].str.replace('\,', '').astype(float)
Next, let's have a quick look at the X and Y scatter plot:
plt.scatter(df.X, df.Y)
It seems that there is an outlier in the X coordinate. Let's find the location of the outlier:
df[df.X > 1000000]
We remove the X oulier and plot the map of collisions:
dfxy = df.drop(axis = 0, index = df[df.X > 1000000].index)
import plotly.express as px
fig = px.scatter(dfxy, x="X", y="Y", color="Collision_Classification",
hover_data=['Location'])
fig.update_layout(autosize=False,
width=1400,
height=800,title_text="Hover over the points to see the location of collision")
fig.show()
Most collisions have been recorded in downtown core, Orleans, Barrhaven and Kanata.
plt.figure(figsize=(10,100))
count = df['Time'].value_counts().head(500)
sns.barplot(count.values, count.index)
plt.xlabel('Number of Collisions', fontsize=12)
plt.title('Time', fontsize=15)
While the frequency of collisions in 24 hours is between 0-100, there is an exception for midnight which has been recorded as (00:00) with 250 collisions. This is an outlier and probable explanation is when the time is unknown has been also recorded as (00). A closer inspection of other data columns confirm that this is indeed the case for all columns; for example in columns Road Surface and Light, when the value of a column is unknown the numeric encoder is 00.
From the statistics overview of the dataset we know that the number of unique recorded times are 1338. To reduce this and have a better view of time we only keep track of the hours in Time columns and leave out the minutes. Then we view the histogram:
hours = df['Time'].str.split(':', expand=True).astype(int)[[0]]
hours
hours.hist(bins=24, range=(0, 24), histtype='bar')
plt.ylabel('Number of Collisions', fontsize=12)
plt.title('Time in 24 hours', fontsize=15)
As expected the most frequent recording of collisions is during rush hours (i.e. commuting to and form work) 8-9 AM and 3-6 PM.
plt.figure(figsize=(10,100))
count = df['Date'].value_counts()
sns.barplot(count.values, count.index)
plt.xlabel('Number of Collisions', fontsize=12)
plt.title('Date', fontsize=15)
More than 140 collisions recorded in February 17, 2016. Now let's see which month has the most recorded collisions:
months = df['Date'].str.split('/', expand=True).astype(int)[[0]]
months
hours.hist(bins=12, range=(0, 12), histtype='bar')
plt.ylabel('Number of Collisions', fontsize=12)
plt.title('Date in Months', fontsize=15)
The frequency of collisions increases in the Fall and Winter. Most collisions recorded in December, could it be that roads are busier closer to the Christmas holiday?
plt.figure(figsize=(10,5))
count = df['Environment'].value_counts()
sns.barplot(count.values, count.index, )
plt.xlabel('Number of Collisions', fontsize=12)
plt.title('Environment', fontsize=15)
Collisions often happens in a clear environment! While the frequency of collisions for different types of environment is usually between 0-2000, it is more than 10,000 in clear environment which can be considered as an outlier.
plt.figure(figsize=(10,5))
count = df['Road_Surface'].value_counts()
sns.barplot(count.values, count.index, )
plt.xlabel('Number of Collisions', fontsize=12)
plt.title("Road Surface", fontsize=15)
Collisions often happens when the road surface is dry! While the frequency of collisions for different types of road surface is usually almost between 0-2000, it is more than 8,000 when the road surface is dry which can be considered as an outlier.
plt.figure(figsize=(10,5))
count = df['Traffic_Control'].value_counts()
sns.barplot(count.values, count.index, )
plt.xlabel('Number of Collisions', fontsize=12)
plt.title('Traffic Control', fontsize=15)
Most frequently collisions happen when there is no traffic control which is not surprising.
plt.figure(figsize=(10,5))
count = df['Collision_Location'].value_counts()
sns.barplot(count.values, count.index, )
plt.xlabel('Number of Collisions', fontsize=12)
plt.title('Collision Location', fontsize=15)
Collisions often happens when in daylight! While the frequency of collisions for different types of Light is usually almost between 0-3000, it is more than 9,000 in daylight which can be considered as an outlier.
plt.figure(figsize=(10,5))
count = df['Light'].value_counts()
sns.barplot(count.values, count.index)
plt.xlabel('Number of Collisions', fontsize=12)
plt.title('Light', fontsize=15)
plt.figure(figsize=(10,5))
count = df['Collision_Classification'].value_counts()
sns.barplot(count.values, count.index)
plt.xlabel('Number of Collisions', fontsize=12)
plt.title('Collision Classification', fontsize=15)
Luckily most recorded collisions recorded as property damage only. There are a few with fatal injuries. We can have a closer look at collisions yielding to fatal injuries:
dfCC = df['Collision_Classification'].str.split('-', expand=True)[[0]].astype(int)
df_fatal = df[dfCC[0] == 1]
df_fatal.shape[0]
23 fatal injuries have been recorded. We can find their location on the map (note that we can activate or deactivate the side bars for collision classification by clicking on them):
import plotly.express as px
fig = px.scatter(dfxy, x="X", y="Y", color="Collision_Classification",
hover_data=['Location'])
fig.show()
plt.figure(figsize=(10,5))
count = df['Impact_type'].value_counts()
sns.barplot(count.values, count.index)
plt.xlabel('Number of Collisions', fontsize=12)
plt.title('Impact type', fontsize=15)
import plotly.express as px
fig = px.scatter(dfxy, x="X", y="Y", color="Impact_type",
hover_data=['Location'])
fig.show()
Now let’s look at the pairwise correlation of all features with one another. It helps to decide which bivariate distributions to investigate.
To perform pairwise correlation between our categorical features, inspired by the article The Search for Categorical Correlation, we use the function association from dyton package developed by Shaked Zychlinski.
from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
from dython.nominal import associations
categorical = df.drop(['X', 'Y'], axis=1 )
categorical.head()
associations(categorical, nominal_columns='all', figsize=(10,10))
The most correlated features are Location and Traffic control. Below shows the collision map with colored data for traffic control. We get almost the complete map if we only activate the No COntrol case.
import plotly.express as px
fig = px.scatter(dfxy, x="X", y="Y", color="Traffic_Control",
hover_data=['Location'])
fig.show()
Almost 50% of collisions are due to No Control in traffic.
import plotly.graph_objects as go
fig = go.Figure(go.Pie(
values = df.Traffic_Control.value_counts().tolist(),
labels = df.Traffic_Control.value_counts().index,
texttemplate = "%{label}: %{value} <br>(%{percent})",
textposition = "inside"))
fig.show()
The source for the dataset is reliable, though we didn't check that all the entires in our dataset match with the source dataset. However, a quick comparison of the collision map we created with our data with the source map shows that they are compatible. The only entry that is misleading is the midnight for column Time which was mentioned in the previous section; (00:00) in that column does not always count for the midnight but also for unknown time.
We mentioned potential outliers in each column when we did the univariate/bivariate analysis. The only outlier that was problematic and we had to remove, was the one with x coordinate far larger than all other entries. Although it is a valid entry for the city of Ottawa, i.e. the location exists, it is way far east.